EvoImp: Multiple Imputation of Multi-label Classification data with a genetic algorithm

Missing data is a prevalent problem that requires attention, as most data analysis techniques are unable to handle it. This is particularly critical in Multi-Label Classification (MLC), where only a few studies have investigated missing data in this application domain. MLC differs from Single-Label Classification (SLC) by allowing an instance to be associated with multiple classes. Movie classification is a didactic example since it can be “drama” and “bibliography” simultaneously. One of the most usual missing data treatment methods is data imputation, which seeks plausible values to fill in the missing ones. In this scenario, we propose a novel imputation method based on a multi-objective genetic algorithm for optimizing multiple data imputations called Multiple Imputation of Multi-label Classification data with a genetic algorithm, or simply EvoImp. We applied the proposed method in multi-label learning and evaluated its performance using six synthetic databases, considering various missing values distribution scenarios. The method was compared with other state-of-the-art imputation strategies, such as K-Means Imputation (KMI) and weighted K-Nearest Neighbors Imputation (WKNNI). The results proved that the proposed method outperformed the baseline in all the scenarios by achieving the best evaluation measures considering the Exact Match, Accuracy, and Hamming Loss. The superior results were constant in different dataset domains and sizes, demonstrating the EvoImp robustness. Thus, EvoImp represents a feasible solution to missing data treatment for multi-label learning.


Introduction
Missing data is ubiquitous in data analysis [1].Their causes are the most diverse and related to the application domain.These include drawbacks in data acquisition, measurement errors, sensor network problems, data migration failures, and unwillingness to respond to survey questions [2,3].Since data analysis algorithms/methods are not designed to deal with Missing Values (MVs), it is essential to treat them before aiming to guarantee the results' validity, impairing the research conclusions [1,4,5].MVs are problematic because of the risk of bias, which depends on the type of missing data, the extent of the missingness, and how to deal with MVs in the analyses [1].Thus, it is critical to deal with the missing data timely for intelligent decision-making [6].
Several techniques have emerged to address this problem [4,7,8].LIN [4] comments that if the MVs rate is less than 10% or 15%, they can be removed without causing any significant loss to the mining process.However, this does not mean that the datasets in any problem domain must follow this rule; in other words, small amounts of missing data may contain essential information that must be managed [9].In addressing this issue, the literature suggests using missing data imputation methods, which involve replacing missing data with actual (plausible) values.While this approach allows for more data retention compared to deletion, it requires time to generate reasonable replacement values [10,11].
A naïve method for tackling the missing values issue is by Single Imputation (SI).This method involves filling in missing values with a single estimated value, often based on mean, median, or regression models [4].While this approach simplifies the dataset and makes it easier to analyze, it can introduce bias and underestimate uncertainty in the results [12,13].To overcome this limitation, Rubin [14] introduced a gold-standard imputation strategy within the scientific community-Multiple Imputations (MI) for handling missing data.In contrast with SI approaches, this method seeks to find a single solution in which m complete solutions are created in the operational database such that m > 1.These solutions were analyzed separately and combined to obtain the best solution [15,16].To reduce the missing values prediction error, using metaheuristics could optimize the value that would be imputed [15].Notably, bioinspired strategies such as Genetic Algorithms (GAs) are prominent in optimizing solutions [17].
The GAs were proposed by Holland [18].It is an optimization heuristic based on "the survival of the fittest", inspired in Charles Darwin´s evolutionary theory.Regarding the GAs usage for Multiple Imputations, it is crucial to acknowledge the work of Garcia [19] and the MultImp algorithm [15].The MultImp algorithm serves as the cornerstone for this research.This algorithm employed genetic algorithms for multiple imputations and was also applied for Multi-Label Classification (MLC) scenarios.The authors contend that data mining tasks, particularly those related to data classification, are notably sensitive to addressing MV.Furthermore, classification tasks are widely used to assess the accuracy (ACC) of imputation methods [5,11,20].Consequently, the higher the classification accuracy, the more successful the imputation method.However, only a few studies have employed MLC.In contrast to Single-Label Classification (SLC), or simply data classification, which associates an example with a single label, MLC allows an instance to be associated with multiple labels, thereby increasing the complexity of classification tasks [21,22].Further details on this topic will be highlighted in the Background section.
Considering the importance of handling missing values in data analysis and the available solutions in the existing literature, this work presents an efficient algorithmic approach for multiple imputations applied to multi-label classification tasks.This method is named EvoImp, a combination of "evolutionary" and "imputation".Furthermore, the name is inspired by Mul-tImp [15], which serves as the foundation for our algorithm and has shown promise in its preliminary stages for multiple imputations with missing data.EvoImp enhances the parameterization of MultImp to maximize its imputation capabilities and explores new configurations for computational experiments.
We conducted a rigorous benchmarking process to validate the proposed method's performance using diverse multi-label datasets.We compared EvoImp with well-established imputation methods documented in the literature.These datasets were systematically subjected to six missing value rates to simulate the Missing Completely At Random (MCAR) mechanism.The outcomes of these experiments were meticulously evaluated using five distinct classifiers.This comprehensive evaluation provides insights into the strengths and potential limitations of our EvoImp when applied to real-world multi-label classification scenarios.By addressing the challenges associated with missing data in this context, our work aims to advance multi-label classification and the broader field of data analysis.
Accordingly, the remainder of this paper is organized as follows.The section "Background" presents a preliminary background.The section "EvoImp-Proposed Method" included the proposed method in this section.The section "Computational Experiments" details the experimental setup.The performance of the method and comparison with data imputation techniques are demonstrated in sections "Results and Analysis" and "Discussion".Finally, section "Conclusion and Suggestions for Future Work" summarizes the paper and points out potential directions for future exploration.

Multi-label Classification and classical approaches to handling MVs
In single-label classification problems, a set of class labels is predetermined, and each object must be associated with one and only one label [23].Formally, let X denote the input/feature space, and y denote the class value, where y 2 L, which is the output space (a set of disjoint class labels).In this case, each sample is strictly associated with a single class label [24,25] However, there are increasingly more contexts in which data may belong to more than one class label.This classification condition is referred to as Multi-Label classification.Initially, MLC primarily focused on tasks such as text categorization, protein function classification, music categorization, semantic scene classification, and medical diagnosis [23,24,26].Recently, new applications have emerged in Computer Vision, Natural Language Processing, and Data Mining, including Video Annotation, Legal Text Mining, and User Profiling [27].According to [25,28], similar to SLC, MLC is represented by X and y, where each sample x 2 X is assigned a subset of the output space (a set of non-disjoint class labels).Table 1 illustrates a toy example depicting the difference between SLC and MLC, adapted from [29].Considering that the data in Table 1 comprises 5 instances (x 1 , x 2 , x 3 , x 4 , x 5 ) and 3 labels (y 1 , y 2 , y 3 ).
Table 1a illustrates the SLC scenario, where five data instances (x 1 to x 5 ) are each strictly associated with a single label (y 1 to y 3 ).For instance, x 1 is associated with y 1 , x 2 is associated with y 2 , and so on.On the other hand, MLC allows data instances to be associated with multiple labels simultaneously.Table 1b demonstrates the MLC scenario, where the same five data instances (x 1 to x 5 ) can have multiple labels assigned to them.For example, x 1 is associated with both y 1 and y 2 , x 2 is associated with both y 2 and y 3 , and so forth.This distinction highlights how SLC restricts each data instance to a single label, while MLC permits instances to belong to multiple labels simultaneously, making it more suitable for scenarios where objects or data points can be associated with different classes.
Although the difference is subtle in theory, MLC tends to be more challenging in practice.Gonc ¸alves et al. [23] and Sa ´et al. [25] enumerated the following reasons for this: • The possible classes of a given instance (output space) in MLC grow exponentially from the increasing number of labels.Therefore, when considering that a problem has L distinct labels, the size of the output space in MLC is 2 L (combination of labels) while it is only L in SLC; • An MLC algorithm must consider whether there exists or not a correlation between labels.This kind of correlation is an essential step to ensure the effectiveness of several MLC processes [24,30,31]; • MLC systems performance evaluation uses different metrics than those traditionally used in SLC [32].In SLC, the rating of a new instance can be either correct or wrong.On the other hand, in MLC, the result can be partially correct.It occurs when the classifier predicts some correct labels but includes some incorrect predictions or even omits a label that should be predicted.This problem requires cautious attention since some metrics follow contrasting aspects to define what is a good MLC prediction [25,33]; • Unlike SLC problems, which traditionally involve the analysis of relational (structured) data, MLC applications typically address big data tasks, which involve semi-structured or unstructured data [24,34].
All these challenges have amplified the complexity associated with handling MVs.Nevertheless, finding studies that relate MLC and MV is not straightforward, as demonstrated in [4,8,17].
In this context, we emphasize a limited number of studies that specifically address the issue of missing labels [35,36], which means focusing on predicting an unknown label.Wang et al. [35] present a multi-label feature selection that considers feature interaction.For that, the authors use the definitions of multi-label neighborhood information entropy and multi-label neighborhood mutual information to mitigate the negative impact of missing labels.Cheng, Song & Qian [36] focus on addressing missing labels by leveraging label correlations and implementing a two-level kernel extreme learning machine autoencoder.The authors verified the proposed method on both missing and complete label datasets.Since these studies primarily focus on missing labels rather than missing values (predictive features), to the best of our knowledge, there is no work addressing missing values in the predictive feature space in an ML scenario.Thus, this constitutes one of the contributions of the present study.

Bio-inspired computation for the handling of MVs
Tran, Zang, and Andreae [37] proposed a data imputation method by adopting an approach based on genetic programming called GPMI.An MI strategy was applied in this method, and an estimation of missing values was performed using regression techniques.The GPMI was compared with seven imputation methods through an experiment carried out in eight datasets and applying seven different missing values ratios (5, 10, 20, 30, 40, and 50) with the aid of MCAR as a missing data mechanism.The classifier's accuracy was the performance measure adopted.The results suggest that the planned method performed better than all methods.According to the authors, genetic programming was primarily responsible for these results because the algorithm initially used random samples to fill the gaps before being submitted to genetic processes.The results confirmed that strategies based on evolutionary algorithms are feasible alternatives for missing values treatment.
Shahzad, Rehman, and Ahmed, in their study, "Missing Data Imputation using Genetic Algorithm for Supervised Learning" [38], employed GA to search for plausible values for missing data imputation.An exciting strategy adopted in this study is using information gain to observe how solutions are found as the process grows.In an experiment with five datasets that originally contained missing values, the proposed method was compared with other imputation approaches: the average, lowest value, highest value, zero, and MI.They used the following performance measures: predictive accuracy, precision, recall, F-measure, and the area under the Receiver Operating Characteristic (ROC) curve, with the following classifiers: NB-tree, PART, JRIP, Naive Bayes, KNN, and J48.The authors noted that the GA-based method showed promising results and worked well in datasets with a high percentage of missing values.
In [39], an algorithm called MOGAImp was proposed for multiple imputation datasets based on genetic algorithms.One of the exciting strategies of this work is to apply a multiobjective approach, which until then had not been adopted in the literature for the performance analysis of imputation techniques.This approach involves simultaneously employing two or more evaluation measures.It can be explained by the fact that there are distinctions between various performance measures because, while one increases, the other declines.In the case of MOGAImp, two conflicting measures were used: the classifier accuracy and the predictive accuracy of the imputation method, calculated using Normalized Root-Mean-Square error (NRMSE) and the Pareto front.
Another critical factor in the study conducted by [39] concerns population initialization, which employs a pool of candidate solutions based on each attribute.The solution pool involves grouping all possible dataset values for the attribute that has a missing value (by lexicographically comparing two strings in cases of categorical variables).The method was experimentally compared with other well-known techniques in the literature, employing benchmarking through several databases with missing values.The results demonstrated that the method achieved competitive performance and, according to the authors, demonstrated potential for real-world applications.However, high computational power is required for handling the MVs individually with MOGAImp and through the solution pool.Additionally, this strategy is an excellent alternative to a mixture of genetic materials.Therefore, it has been adopted in EvoImp as a baseline for mutation operations.
In [15], the authors created a scheme based on genetic algorithms, which served as a baseline for developing and analyzing the method employed in this study.The strategy, nominated as MultImp, predicts multiple imputations of datasets in a multi-label classification model.In this study, the authors conducted experiments using four databases that were initially completed.Subsequently, 5% of the missing values were added through the MCAR mechanism.Binary relevance (BR) was employed as the multi-label classifier, with C4.5 as a parameter.In the test scenario, MultImp was compared with two other imputation methods (K-Nearest Neighbors Imputation-KNNI and Most Common-MC) and evaluated lexicographically using the following measures: Exact Match (EM), Accuracy, and Hamming Loss (HL).The preliminary results of this study proved to be promising, particularly in the case of EM, where the performance achieved by the method was better in all the datasets used and justified adopting the lexicographical approach.
For a comprehensive summary of the works discussed in this section, we have provided a detailed table in our supplementary material, available on the project's GitHub repository (https://github.com/jacobjr/EvoImp).

EvoImp-Proposed method
Since EvoImp is based on a genetic algorithm, the following descriptions explain how EvoImp was mapped and configured within the GA structure: a) the codification of individuals, b) the formation of the initial population, c) the configuration of genetic operators, and d) the definition of the fitness function.Fig 1 presents a toy example of this structure, which will be detailed in the following subsections.

Individual encoding and population initialization
The individual encoding of EvoImp took place in the following form: the variables in the datasets represent individual genes.Genes initially marked with "?" represent the missing values (Fig 1(a)).Each individual is represented by a completed ("accomplished") instance of the databases (Fig 1(b)).The phenotype consists of imputed values, while the genotype represents these values in binary form, as illustrated in Fig 1(c).
The initial population comprised five simple imputation methods for the generation of each individual (Fig 1(d)).All imputation methods are well-known and established in the literature [7]: K-means Clustering Imputation, KNNI, WKNNI, Concept Most Common (CMC), and MC.The parameters for the KNNI, WKNNI, and KMI methods followed the guidelines set by the authors.This kind of population initialization was adopted in EvoImp to reduce the search space and, hence, the computational costs.
The methods employed are as follows [7]: • KNNI: Whenever there is a missing value, the K-nearest neighbors closest to the instance containing the MV are determined.The most common value among the K-nearest neighbors was used to impute nominal attributes.For numerical attributes, imputation is performed by calculating the average of the neighboring values; • WKNNI: This technique involves determining the distances between K-nearest neighbors and a weighting distribution regarding the distances between each neighbor.After this, the KNNI process was repeated; • KMI: This technique divides a database into clusters based on their features.Once this has been done, the K-nearest neighbors technique is applied when deciding which value should be imputed; • MC: In this method, the most common value is adopted for imputation in nominal attributes and the average of all corresponding attributes in the case of numerical attributes; • CMC: This method does the same thing as MC but only employs the referenced attribute class with MV.
In contrast to MOGAImp [39], which employs random initialization of the initial population, the proposed method optimizes simple imputations through evolutionary processes to perform multiple imputations.This approach reduces the search space and introduces a novel method.This reduction in search space is particularly beneficial in scenarios where computational cost is critical in objective function calculations, such as multi-label classification.
It is also noteworthy that the presented work has two innovative contents: 1) using simple imputation methods as a priori solution, reducing the search space; 2) treating missing values in the multi-label scenario.To our best knowledge, there is no similar study in the literature.

Genetic operators
The individual selection involves a tournament in which two (or more) members of the previous population are selected, and the better one is chosen based on fitness value, as illustrated in Fig 1(e).This procedure was followed until a limited number of individuals from the current generation were obtained.The best individual is always selected through elitism [40].
In the literature, numerous methods for parameter tuning and control have been proposed and analyzed.[41] describes some of these methods and discusses various trends and challenges in the field.Specifically, [42] conducted experiments to find appropriate settings for these parameters when applying evolutionary algorithms to a multi-objective problem class.They concluded that determining the value of the scaling factor can be difficult and is highly dependent on the specific problem.Considering these findings, initial tests were conducted to define the parameters used in our study.In line with the work of [42], the initial percentage of Crossover was delimited to [0.8, 1.0], following the standard proposal for non-separable problems like the one tackled in our research.EvoImp employs a crossover for 80% of the individuals using an n-point crossover operator [43], as shown in Fig 1(f).It is also consonant with the work of [44].
The mutation process is performed on 20% of the individuals chosen randomly, except for the best one.For each individual to be mutated, the imputed value is exchanged for a candidate value.The mutation is applied only to genes that contain missing values.To accomplish this, each attribute in the dataset has a set of solutions, as shown in Table 2.This set is formed by considering all possible response options for that attribute in the evaluated dataset.
Table 2a displays a toy dataset containing five records and four attributes: "Year", "Gender", "Age", and "Have Credit".Some values in the dataset are missing and are represented by "?".Table 2b lists the possible values for each attribute.For example, the "Year" attribute can have values 1998, 2005, or 2010; and the "Gender" attribute can have values M or F. The same reasoning is applied to the other attributes.
Lobato et al. [39] adopted this technique to initiate the first MOGAImp population.The mutation operator was not implemented in MultImp.The lack of it caused a premature convergence, limiting the method's robustness.That operator is one of the main differences between MultImp and EvoImp.In other words, the proposed method implements a strategy to avoid local minimum.
The algorithm's search and optimization process occurs over predetermined generations.The population goes into a growth phase, starting with the number of MI methods adopted in the population initialization and increasing by its cross-over.This strategy aims to provide population diversity.In the second phase, the population is gradually reduced, achieving the same initial population size, allowing the analysis to choose the best solution qualitatively.

Fitness function
As mentioned earlier, the method was evaluated on an MLC scenario.For this, EvoImp performs a classification process on each individual.The goal is to analyze the performance of the classifier and, consequently, the data imputation efficiency.Three performance measures were adopted to evaluate the classifier, as with MultImp: Exact Match, Accuracy, and Hamming Loss.The notation used by [15,45] were adopted to describe these measures: (i) n: number of instances in the test set; (ii) q: number of labels; (iii) Y i : set of original labels, for instance, i; and (iv) Z i : set of predictive labels, for instance, i.
• Exact Match calculates, using a binary system, whether all the instance labels are predicted correctly.This measure, as expressed in Eq 1, is assumed to be trivial because it ignores partial predictions: • Accuracy is also a measure that counts the correctly predicted labels of an instance.In this case, partial predictions are taken into account.Eq 2 expresses the mathematical model of this measure: • Hamming loss is a measure that, in contrast to accuracy, evaluates the classifier's performance by finding the average of incorrect predictions.Eq 3 describes this measure: These measures were used in lexicographical order; in other words, this approach prioritizes all the problem's objectives and then tries to satisfy them, keeping a list of priorities [46].Thus, the fitness ( f ) for the problem solution can be expressed as Eq 4: where n is the number of objectives defined; f n is an optimization goal.Given two fitness evaluations f 1 and f 2 and a precision threshold t, the lexicographic relation between them (noted as � l and � l ) can be defined [47]: As can be observed, the Eq 5 shows f 1 � l f 2 , which means that f 1 is lexicographically less than f 2 .This relationship is established when there exists an index k in in the range [0, n 0 ) \ N, Additionally, the difference between f k 1 and f k 2 is greater than or equal to t.This ensures that the k-th components differ significantly by at least t.Finally, the absolute differences between corresponding components f i 1 and f i 2 should be less than t for all i less then k.In essence, this relation means that f 1 is superior to f 2 in terms of some objectives.The Eq 6 determines equality in lexicographical order ( f 1 = l f 2 ).This occurs when the absolute differences between corresponding components f i 1 and f i 2 are all less than t for all i in the range ½0; n 0 Þ \ N. In other words, f 1 and f 2 are considered equal regarding their performance across objectives.Finally, the Eq 7 presents f 1 � l f 2 , which means that f 1 is either less than or equal to f 2 in lexicographical order.It combines the � l and � l relations, indicating that f 1 is either better than or equal to f 2 in terms of the defined objectives.
These equations are used to rank and compare solutions or fitness evaluations in optimization problems, considering the objectives, prioritization, and performance.The lexicographical order approach allows for precise, multi-objective optimization when there are multiple criteria or objectives to be considered.Once the threshold t has been introduced, this formulation differs from the pure mathematical lexicographic relation.It permits the decision maker to choose the precision to compare two fitness functions.This relation allows the ranking of solutions of EvoImp as follows: 1.The EM behavior is evaluated; 2. If two or more individuals match their respective scores, the ACC evaluation is checked; 3. If the tie remains, the HL evaluation is used.This approach allows different performance measures to be added to a single evaluation [45].It is similar to the classical lexicographical approach, but once evolutionary algorithms are adopted, local optima can be avoided [47].

The EvoImp algorithm
As shown in Algorithm 1, EvoImp begins the execution by creating and evaluating individuals for the initial population.The datasets are initially imputed using simple imputation methods: KNNI, CMC, MC, KMI, and WKNNI (lines 1-5).Afterward, the population is evaluated and ranked based on each individual's performance (line 6).The algorithm applies the genetic operators if the stopping criterion is not attained (e.g., the number of generations).
Algorithm 1: EvoImp The elitist individual is always passed on to the next generation (line 8).The selection is performed using the tournament selection operator (line 10).Two individuals are randomly drawn in this process.These two parents exchange genetic material using a crossover operator.These steps are repeated until the population is complete.Afterward, the mutation follows the established rate (lines [15][16][17][18][19][20].The new population is arranged, and the iterative process continues until the stopping criterion is reached.The return of the algorithm is the individual that achieves the best performance (line 23).
In summary, EvoImp adopts the configuration for the parameterization of MultImp [15], except for the mutation operator, as pointed out earlier.Besides, we also corrected bugs and optimized the code, bearing in mind maintainability and reuse.Moreover, we implemented the lexicographic strategy and expanded the computational tests, expanding the technical-scientific contribution of the present work.

Datasets
The experiments were designed using six multi-label datasets from the UCI Machine Learning repository (https://archive.ics.uci.edu/).The quantity datasets agree with the literature review conducted by [17], which mapped 48 papers related to experiments in the context of data imputation.Chiu's work [17] shows that most papers (77%) use up to six datasets in experiments.Another interesting finding of Chiu et al. [17] is that the UCI Machine Learning Repository is the most used.Regarding the characteristics of the datasets, most use small-scale datasets, which contain fewer than 15 attributes and 800 instances.Table 3 shows the datasets used and their features.
Regarding multi-label datasets, the works of [35,48] must be mentioned.These studies, as well as EvoImp, used datasets obtained at the UCI repository and formatted using the Mulan library (http://mulan.sourceforge.net/).The datasets used in these papers have similar characteristics (cardinality, density, and the number of instances) to those chosen in this paper.This observation highlights the experimental setup consonance with the state of the art and the EvoImp potential applicability in real-world problems.

Experimental setup
In the experiments, the missing values were artificially added to each dataset with the following rates: 5%, 10%, 15%, 20%, 25%, and 30%.This "amputation" process was carried out using the MCAR mechanism, as described in Santos (2019) [49].The complete experimental configuration consisted of 36 datasetss with missing data, and these datasetss underwent a comparative evaluation.This evaluation involved five simple imputation methods: KNNI, CMC, MC, KMI, and WKNNI.
The following classification methods were used for the multi-label learning tasks: Binary Relevance (BR), Hierarchy of Multi-label classifiER (HOMER), Multi-Label K-Nearest Neighbors (ML-KNN), Classifier Chains (CC), and Ensembles of Classifier Chains (ECC) [21,50].K-fold cross-validation was used for the classification model's evaluation (learning and testing).Table 4 summarizes the overall parameters which were used in the experiments.
Regarding the simple imputation methods, the parameters recommended by [7] were used.The mutation rate (MR) chosen is higher than the typical usage rates because the starting point is not random.Therefore, considering that the initial population is obtained by other methods, parameterization experiments demonstrated that a higher MR yields better results, providing fast convergence.The entire experimental setup and the obtained results are available as supplementary material on the project's GitHub (https://github.com/jacobjr/EvoImp).
• The simple imputation methods used for forming the first population of EvoImp and in the comparative analyses are implemented in KEEL-software (http://www.keel.es/)[53].
It is noteworthy that GA used in the EvoImp was fully implemented by the authors despite KEEL providing a framework for evolutionary computation.This design decision aimed to give us more control over the experiments.The computational complexity is another crucial aspect to consider in implementing this proposed method.It plays a vital role in determining the feasibility and efficiency of applying bio-inspired techniques to solve optimization problems.Addressing this concern and reducing computational complexity enhances the algorithm's applicability and scalability.As a result, it makes it more suitable for handling larger datasets and complex optimization landscapes, particularly in multi-label classification tasks.More detailed information about EvoImp's computational complexity can be found in the supplementary materials on the project's GitHub repository.

Results and analysis
This section examines the results obtained from the computational experiments.The data displayed in the following tables show the differences in performance between the methods for each percentage of missing values analyzed (5%, 10%, 15%, 20%, 25%, and 30%).The best results are highlighted in bold for easy viewing.The metrics (Exact Match ("), Accuracy ("), and Hamming Loss (#)) are presented with these symbols, where (") indicates that higher values reflect better performance, and (#) indicates that lower values represent better performance.

Binary relevance
In the learning performed with the BR classifier, the results showed that the EvoImp was numerically superior (Table 5).In the EM evaluation, EvoImp outperformed its competitors in 35 of the 36 datasets evaluated (97.22%).The proposed method demonstrated superior performance compared to others in 18 scenario datasetss (50%) regarding the Accuracy evaluation measure.Finally, considering the HL, EvoImp outperformed the baseline methods in 16 datasets (44.44%).
It is essential to highlight the priorities adopted in the EvoImp lexicographic order, prioritizing the evaluation with EM, as mentioned in the Subsection "Fitness Function", which explains the performance decrease for the ACC and HL metrics considering the binary relevance classifier.

Hierarchy Of Multi-label Classifier (HOMER)
The results for the HOMER classifier are given in presented in Table 6.Analyzing the results, it is possible to observe that EvoImp is also superior to the others in 35 of the 36 datasets used in the experiments (97.22%) regarding the EM metric.These results corroborate the ones obtained from the Binary Relevance classifier.
Continuing analyzing Table 6 results, regarding the ACC evaluation measure, EvoImp outperformed the baseline methods in 23 datasets (63.88%).The HL results show that EvoImp had the slightest error in classification in 19 out of 36 datasets (52.78%).In summary, EvoImp outperformed the methods for all performance measures for HOMER classifier, in consonance with the results for BR classifier as well."%" refers to the percentage of missing data analyzed (5%, 10%, 15%, 20%, 25%, and 30%).
2 "Db" refers to the datasets used in the experimental setup, and these letters' abbreviations can be found in Table 3.

3
Acronyms are related to each data imputation method tested, listed in S1

Multi-Label k-Nearest Neighbors
The results obtained with the ML-KNN classifier is shown in Table 7.As can be seen, EvoImp showed similar performance to the previous scenarios considering the BR and the HOMER classifiers.For instance, considering the primary analyzed metric (EM), EvoImp outperformed the baseline methods at 97.22%.Considering the ACC and HL, the EvoImp presented superior performance for 20 (55.55%) and 22 (61.11%)datasets, respectively.

Classifier Chains
The results for the Classifier Chains are presented in Table 8.Again, EvoImp outperformed the baseline methods for all evaluation measures considered: EM with superiority in 32 out of 36 datasets (88.88%),ACC with 30 (83.33%), and HL with 22 (61.11%).

Ensembles of Classifier Chains
The last scenario analyzed was considering the Ensemble Classifier Chains method.The results are shown in Table 9.The results obtained with the ECC (Table 9) also show a significant advantage of EvoImp over competitors in the analyses performed.However, EvoImp had the lowest performance, with numerical superiority in 29 (80.55%)datasets in the evaluation with EM, 16 (44.44%)for ACC, and 17 (47.22%)for HL.In summary, the EvoImp performance for the ECC presents the same pattern described in the previous scenarios, demonstrating the EvoImp robustness.

Discussion
In summary, EvoImp proved to be competitive in all classification scenarios, which underlines the fact that the optimization of imputation through evolutionary strategies, such as genetic algorithms, is an excellent alternative for handling missing values in the preprocessing phase of data analysis.It should be noted that the algorithm created performed optimizations based on simple imputation methods (applied to the initial population of EvoImp).Considering the computational experiments, other factors should be highlighted regarding the EvoImp performance: • Maximizing the labels' success: The primary purpose of classification, particularly in this study, is the correct labeling of data instances, a task that is becoming increasingly complex in the multi-labeling scenario.In the EM measure, where the classifier must label all the classes of an instance correctly so that they can be counted correctly, the proposed method achieved better performance in 92.22% of all the datasets in all the scenarios.This performance is more evident in BR, HOMER, and ML-KNN, with 35 out of the 36 datasets.Another measure that allows this conclusion is ACC.The superior performance achieved by the EvoImp is more apparent in the analyses with the CC and HOMER classifiers (with 30 and 23 datasets, respectively).In general terms, EvoImp was better in 68.3% of all used datasets.This can be explained by the fact that this measure is flexible regarding the number of successes achieved by labels.For example, if an instance belongs to five labels and obtains four correct labels, it achieves an 80% degree of accuracy.At the same time, the excellent performance of ACC indicates that the classifier can increase its labeling capacity.This can be confirmed by analyzing the classification error evaluated using HL.In this metric, the proposed method obtained the lowest error (53.33%).It is worth mentioning that the results obtained reflect the lexicographic order chosen (as explained in subsection "Fitness function"), demonstrating the method's superiority over all the others.A comparison shows that when ACC increases, there is an automatic reduction in the HL error, justifying the usage of  "%" refers to the percentage of missing data analyzed (5%, 10%, 15%, 20%, 25%, and 30%).
2 "Db" refers to the datasets used in the experimental setup, and these letters' abbreviations can be found in Table 3.
2 "Db" refers to the datasets used in the experimental setup, and these letters' abbreviations can be found in Table 3.
2 "Db" refers to the datasets used in the experimental setup, and these letters' abbreviations can be found in Table 3. 3 Acronyms are related to each data imputation method tested, listed in S1 Table .Abbreviations. https://doi.org/10.1371/journal.pone.0297147.t009lexicographical order instead of more complex approaches, such as Pareto Frontier Analysis, used to deal with conflicting measures.
• Superior performance in datasets over different domains and sizes: The six datasets used in the experiments can be divided in terms of i) different domains-the multi-label datasets used were related to the areas of audio (1), music (2), image (2), and biology (1); ii) their sizes-considering the number of instances and attributes, as was done by [54].These datasets were curated to provide a robust experimental setup, simulating diverse real-world problems.It was noted that EvoImp performed superior in all the tests, proving that the method is robust on datasets of different domains and sizes.
• Stable performance in the distribution rates of the missing values under study: A critical evaluation of this study is related to the relationship between the missing values percentage and the performance measures.The results show that the EvoImp maintains its consistency, even with variations, which, in this study, was between 5% to 30% (with a rate of k = 5%).These rates agree with those used in most studies in the literature-one related work that addresses this discussion is [17].A total of 48 related articles from 2011 to 2021 were selected in this investigation.About missing rates, this review indicated that 60,4% used missing rates < = 30% or did not reveal their missing rates for the experimentation.
The above aspects demonstrate that EvoImp is suitable for missing value treatments in realworld scenarios.

Conclusion and suggestions for future work
The data analyses conducted in real-world datasets make it clear that there is a critical need to handle missing values in multi-label classification domain.The ubiquitous presence of MVs and the fact that most of the techniques employed only work or ensure good performance when applied to datasets with complete cases underlines the need to tackle this problem.Data imputation methods have emerged as an alternative solution, searching for plausible values to fill the missing ones.
Therefore, we proposed in this study the EvoImp, an imputation method based on genetic algorithms for the optimization of multiple imputations for missing data applied to multi-label learning.For validation, the method was submitted to an extensive experimental benchmarking process with various multi-label datasets and compared with other state-of-the-art imputation methods.Six missing value rates are applied to the datasets to simulate the MCAR mechanism.The results were analyzed using five classifiers: Binary Relevance, Hierarchy of Multi-label Classifier, Multi-Label k-Nearest Neighbors, Classifier Chains, and Ensembles of Classifier Chains.Three well-known evaluation measures were adopted to assess the experiments: Exact Match, Accuracy, and Hamming-loss.
EvoImp achieved exceptional results in all the scenarios evaluated, being quantitatively superior to the others.These outstanding results make it possible to conclude that the proposed method is suitable for application in real-world scenarios.In addition to a novel approach for dealing with MV in multi-label classification, the present works contribute to the body of knowledge by: i) assessing the impact of missing data on multi-label classification to improve classification robustness; ii) providing an extensive experimental comparison of many state-of-the-art data imputation algorithms, multi-label machine learning classifiers, and performance measures; iii) making source codes and experiments results in a GitHub repository.
In future work, we want to evaluate other missingness mechanisms apart from MCAR and adjust the method for handling high rates of missing data (> 30%).Experiments could also be performed to make the EvoImp learn its parameters (AutoML).Finally, we would like to investigate the Influence of Cardinality and Density Characteristics on Multi-Label Learning with missing values.

Fig 1 .
Fig 1. EvoImp's GA structure example.Toy example of a dataset with MV and how EvoImp's GA works with it.(a) Dataset with missing values; (b) A complete dataset with imputed data.(c) Phenotype: contains the values corresponding to the missing data space; Genotype: represents the genes in binary code and the values of the measurements used in the fitness function.(d) Illustration of how the initial population is initialized.(e) Random selection of parents for crossover.(f) Illustration of crossover being applied to the two selected individuals.https://doi.org/10.1371/journal.pone.0297147.g001 .

3
Acronyms are related to each data imputation method tested, listed in S1